AITopics | side task

Collaborating Authors

side task

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CoTRed-Handed: Stress Testing Chain-of-Thought Monitoring

Neural Information Processing SystemsJun-16-2026, 23:13:34 GMT

As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that while CoT monitoring is more effective than overseeing only model outputs in scenarios where action-only monitoring fails to reliably identify sabotage, reasoning traces can contain misleading rationalizations that deceive the CoT monitors, reducing performance in obvious sabotage cases. To address this, we introduce a hybrid protocol that independently scores model reasoning and actions, and combines them using a weighted average. Our hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates twice higher than action-only monitoring for subtle deception scenarios.

large language model, machine learning, side task, (19 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Law Enforcement & Public Safety > Terrorism (0.55)
Information Technology (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

Arnav, Benjamin, Bernabeu-Pérez, Pablo, Helm-Burger, Nathan, Kostolansky, Tim, Whittingham, Hannes, Phuong, Mary

arXiv.org Artificial IntelligenceNov-26-2025

large language model, machine learning, side task, (18 more...)

arXiv.org Artificial Intelligence

2505.23575

Genre: Research Report (0.52)

Industry: Law Enforcement & Public Safety > Terrorism (0.55)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.73)

Add feedback

Optimizing AI Agent Attacks With Synthetic Data

Loughridge, Chloe, Colognese, Paul, Griffin, Avery, Tracy, Tyler, Kutasov, Jon, Benton, Joe

arXiv.org Artificial IntelligenceNov-5-2025

As AI deployments become more complex and high-stakes, it becomes increasingly important to be able to estimate their risk. AI control is one framework for doing so. However, good control evaluations require eliciting strong attack policies. This can be challenging in complex agentic environments where compute constraints leave us data-poor. In this work, we show how to optimize attack policies in SHADE-Arena, a dataset of diverse realistic control environments. We do this by decomposing attack capability into five constituent skills -- suspicion modeling, attack selection, plan synthesis, execution and subtlety -- and optimizing each component individually. To get around the constraint of limited data, we develop a probabilistic model of attack dynamics, optimize our attack hyperparameters using this simulation, and then show that the results transfer to SHADE-Arena. This results in a substantial improvement in attack strength, reducing safety score from a baseline of 0.87 to 0.41 using our scaffold.

attack policy, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2511.02823

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (0.75)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.70)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)

Add feedback

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

Terekhov, Mikhail, Panfilov, Alexander, Dzenhaliou, Daniil, Gulcehre, Caglar, Andriushchenko, Maksym, Prabhu, Ameya, Geiping, Jonas

arXiv.org Artificial IntelligenceOct-13-2025

AI control protocols serve as a defense mechanism to stop untrusted LLM agents from causing harm in autonomous settings. Prior work treats this as a security problem, stress testing with exploits that use the deployment context to subtly complete harmful side tasks, such as backdoor insertion. In practice, most AI control protocols are fundamentally based on LLM monitors, which can become a central point of failure. We study adaptive attacks by an untrusted model that knows the protocol and the monitor model, which is plausible if the untrusted model was trained with a later knowledge cutoff or can search for this information autonomously. We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs. Using this tactic, frontier models consistently evade diverse monitors and complete malicious tasks on two main AI control benchmarks. The attack works universally against current protocols that rely on a monitor. Furthermore, the recent Defer-to-Resample protocol even backfires, as its resampling amplifies the prompt injection and effectively reframes it as a best-of-$n$ attack. In general, adaptive attacks on monitor models represent a major blind spot in current control protocols and should become a standard component of evaluations for future AI control mechanisms.

large language model, machine learning, side task, (22 more...)

arXiv.org Artificial Intelligence

2510.09462

Country: Asia (0.46)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

Reliable Weak-to-Strong Monitoring of LLM Agents

Kale, Neil, Zhang, Chen Bo Calvin, Zhu, Kevin, Aich, Ankit, Rodriguez, Paula, Team, Scale Red, Knight, Christina Q., Wang, Zifan

arXiv.org Artificial IntelligenceAug-28-2025

We stress test monitoring systems for detecting covert misbehavior in autonomous LLM agents (e.g., secretly sharing private information). To this end, we systematize a monitor red teaming (MRT) workflow that incorporates: (1) varying levels of agent and monitor situational awareness; (2) distinct adversarial strategies to evade the monitor, such as prompt injection; and (3) two datasets and environments -- SHADE-Arena for tool-calling agents and our new CUA-SHADE-Arena, which extends TheAgentCompany, for computer-use agents. We run MRT on existing LLM monitor scaffoldings, which orchestrate LLMs and parse agent trajectories, alongside a new hybrid hierarchical-sequential scaffolding proposed in this work. Our empirical results yield three key findings. First, agent awareness dominates monitor awareness: an agent's knowledge that it is being monitored substantially degrades the monitor's reliability. On the contrary, providing the monitor with more information about the agent is less helpful than expected. Second, monitor scaffolding matters more than monitor awareness: the hybrid scaffolding consistently outperforms baseline monitor scaffolding, and can enable weaker models to reliably monitor stronger agents -- a weak-to-strong scaling effect. Third, in a human-in-the-loop setting where humans discuss with the LLM monitor to get an updated judgment for the agent's behavior, targeted human oversight is most effective; escalating only pre-flagged cases to human reviewers improved the TPR by approximately 15% at FPR = 0.01. Our work establishes a standard workflow for MRT, highlighting the lack of adversarial robustness for LLMs and humans when monitoring and detecting agent misbehavior. We release code, data, and logs to spur further research.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2508.19461

Country: North America > United States (0.28)

Genre:

Workflow (1.00)
Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government (1.00)
Transportation (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)

Add feedback

SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents

Kutasov, Jonathan, Sun, Yuqi, Colognese, Paul, van der Weij, Teun, Petrini, Linda, Zhang, Chen Bo Calvin, Hughes, John, Deng, Xiang, Sleight, Henry, Tracy, Tyler, Shlegeris, Buck, Benton, Joe

arXiv.org Artificial IntelligenceJul-10-2025

As Large Language Models (LLMs) are increasingly deployed as autonomous agents in complex and long horizon settings, it is critical to evaluate their ability to sabotage users by pursuing hidden objectives. We study the ability of frontier LLMs to evade monitoring and achieve harmful hidden goals while completing a wide array of realistic tasks. We evaluate a broad range of frontier LLMs using SHADE (Subtle Harmful Agent Detection & Evaluation)-Arena, the first highly diverse agent evaluation dataset for sabotage and monitoring capabilities of LLM agents. SHADE-Arena consists of complex pairs of benign main tasks and harmful side objectives in complicated environments. Agents are evaluated on their ability to complete the side task without appearing suspicious to an LLM monitor. When measuring agent ability to (a) complete the main task, (b) complete the side task, and (c) avoid detection, we find that the best performing frontier models score 27% (Claude 3.7 Sonnet) and 15% (Gemini 2.5 Pro) as sabotage agents when overseen by Claude 3.6 Sonnet. For current frontier models, success on the side task relies heavily on having access to a hidden scratchpad that is not visible to the monitor. We also use SHADE-Arena to measure models' monitoring abilities, with the top monitor (Gemini 2.5 Pro) achieving an AUC of 0.87 at distinguishing benign and malign transcripts. We find that for now, models still struggle at sabotage due to failures in long-context main task execution. However, our measurements already demonstrate the difficulty of monitoring for subtle sabotage attempts, which we expect to only increase in the face of more complex and longer-horizon tasks.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2506.1574

Genre:

Research Report (1.00)
Workflow (0.67)

Industry:

Law Enforcement & Public Safety > Terrorism (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

FreeRide: Harvesting Bubbles in Pipeline Parallelism

Zhang, Jiashu, Pan, Zihan, Molly, null, Xu, null, Daudjee, Khuzaima, Liu, Sihang

arXiv.org Artificial IntelligenceSep-10-2024

The occurrence of bubbles in pipeline parallelism is an inherent limitation that can account for more than 40% of the large language model (LLM) training time and is one of the main reasons for the underutilization of GPU resources in LLM training. Harvesting these bubbles for GPU side tasks can increase resource utilization and reduce training costs but comes with challenges. First, because bubbles are discontinuous with various shapes, programming side tasks becomes difficult while requiring excessive engineering effort. Second, a side task can compete with pipeline training for GPU resources and incur significant overhead. To address these challenges, we propose FreeRide, a system designed to harvest bubbles in pipeline parallelism for side tasks. FreeRide provides programmers with interfaces to implement side tasks easily, manages bubbles and side tasks during pipeline training, and controls access to GPU resources by side tasks to reduce overhead. We demonstrate that FreeRide achieves 7.8% average cost savings with a negligible overhead of about 1% in training LLMs while serving model training, graph analytics, and image processing side tasks.

freeride, interface, side task, (14 more...)

arXiv.org Artificial Intelligence

2409.06941

Country:

North America > Canada > Ontario > Waterloo Region > Waterloo (0.14)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia (0.04)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (0.93)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

PATO: Policy Assisted TeleOperation for Scalable Robot Data Collection

Dass, Shivin, Pertsch, Karl, Zhang, Hejia, Lee, Youngwoon, Lim, Joseph J., Nikolaidis, Stefanos

arXiv.org Artificial IntelligenceJun-1-2023

Large-scale data is an essential component of machine learning as demonstrated in recent advances in natural language processing and computer vision research. However, collecting large-scale robotic data is much more expensive and slower as each operator can control only a single robot at a time. To make this costly data collection process efficient and scalable, we propose Policy Assisted TeleOperation (PATO), a system which automates part of the demonstration collection process using a learned assistive policy. PATO autonomously executes repetitive behaviors in data collection and asks for human input only when it is uncertain about which subtask or behavior to execute. We conduct teleoperation user studies both with a real robot and a simulated robot fleet and demonstrate that our assisted teleoperation system reduces human operators' mental load while improving data collection efficiency. Further, it enables a single operator to control multiple robots in parallel, which is a first step towards scalable robotic data collection. For code and video results, see https://clvrai.com/pato

artificial intelligence, machine learning, robot, (18 more...)

arXiv.org Artificial Intelligence

2212.04708

Country:

North America > United States (0.14)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (0.35)

Add feedback